DISCRIMINANT ANALYSIS WITH ADAPTIVELY 
POOLED COVARIANCE 

NOAH SIMON AND ROB TIBSHIRANI 

Abstract. Linear and Quadratic Discriminant analysis (LDA/QDA) 

are common tools for classification problems. For these methods 
we assume observations are normally distributed within group. We 
estimate a mean and covariance matrix for each group and classify 
using Bayes theorem. With LDA, we estimate a single, pooled co- 
variance matrix, while for QDA we estimate a separate covariance 
matrix for each group. Rarely do we believe in a homogeneous 
covariance structure between groups, but often there is insufficient 
data to separately estimate covariance matrices. We propose ii- 
PDA, a regularized model which adaptively pools elements of the 
precision matrices. Adaptively pooling these matrices decreases 
the variance of our estimates (as in LDA), without overly biasing 
them. In this paper, we propose and discuss this method, give an 
efficient algorithm to fit it for moderate sized problems, and show 
its efficacy on real and simulated datasets. 

Keywords: Lasso, Penalized, Discriminant Analysis, Interac- 
tions, Classification 



1. Introduction 

Consider the usual two class problem: our data consists of n ob- 
servations, each observation with a known class label G {1,2}, and p 
covariates measured per observation. Let y denote the n-vector corre- 
sponding to class (with rii observations in class 1 and n2 in class 2), 
and X, the n by p matrix of covariates. We would like to use this 
information to classify future observations. 

We further assume that, given class y{l), each observation, Xi, is inde- 
pendently normally distributed with some class specific mean /^^^(i) G MP 
and covariance and that y{l) has prior probability tti of coming 

from class 1 and 7r2 from class 2. Prom here we estimate the two mean 
vectors, covariance matrices, and prior probabilites and use these esti- 
mates with Bayes theorem to classify future observations. In the past 
a number of different methods have been proposed to estimate these 
parameters. The simplest is Quadratic Discriminant Analysis (QDA) 
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which estimates the parameters by their maximum hkehhood estimates 



n 



/Ufc = — > Xi 
y(l)=k 



and 

Sfe = — [xi- Ilk) [xi - ^,kf . 
rii. ^-^ 

y{l)=k 

To classify a new observation x, one finds the class with the highest 
posterior probability. This is equivalent in the two class case to con- 
sidering 

D{x) = log ( — ) - ^ (x - /ii)"^ S^;^ (x - /ii) 



2 

+ ^ (x - fi^V ^2' - /i2) + logdet (s-'/'sf ) 
and if D{x) > then classifying to class 2, otherwise to class 1. 

Linear Discriminant Analysis (LDA) is a similar but more commonly 
used method. It estimates the parameters by a restricted MLE — the 
covariance matrices in both classes are constrained to be equal. So, for 
LDA 

1 " 

±1 = ±2 = -J2 i^i - /^2/(o) i^i - f'ywV 
1=1 

While one rarely believes that the covariance matrices are exactly equal, 
often the decreased variance from pooling the estimates greatly out- 
weights the increased bias. 



Friedman 1989 proposed Regularized Discriminant Analysis (RDA) 
noting that one can partially pool the covariance matrices and find a 
more optimal bias/variance tradeoff. He estimates by a convex 
combination of the LDA and QDA estimates 

tk = AS^°^ + (1 - A)SL°^ 

where A is generally determined by cross-validation. 

We extend the idea of partially pooling the covariance matrices in a 
different direction. We make the further assumption that for most i,j, 
^ (T,2^).j; that the partial covariance matrices are mostly 
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element-wise equal (or nearly equal). Intuitively this says that condi- 
tional on all other variables, most pairs of covariates interact identically 
in both groups. 

Given this assumption, the natural approach is to find a restricted 
MLE where the number of non-zero entries in S^^ — ^ is constrained 
to be less than some c. ie. to find 

argmax Si) + £2(/U2, ^2) 

s.t. ||Sr^-S2^||^<c 

Si,S2 Positive Semi-Definite 

where ik is the Gaussian log likelihood of the observations in class k, 
4(/ifc, Sfc) = -y log(27r)-hy logdet ^ {xi - fikV {xi - fik) 

y(l)=k 

and II ■ II is the number of nonzero elements. Unfortunately, this prob- 
lem is not convex and would require a combinatorial search. Instead 
we consider a convex relaxation 

(1) argmax Si) -h £2(/U2, S2) 

(2) s.t. ||S^^ - S2^||^ < c 

(3) Si, S2 Positive Semi-Definite 

where || ■ ||i is the sum of the absolute value of the entries. Because the 
I ■ I is not differentiable at 0, solutions to ([T| have few nonzero entries in 
S^^ — S^^ with the sparsity level dependent on c. There is a large liter- 



ature about using £1 penalties to promote sparsity (Tibshirani 1996 



Chen et al. 1998 , among others), and in particular sparsity has been 



applied in a similar framework for graphical models Banerjee et al. 
|2008|. Also recently, a very similar model to that which we propose 
has been applied to joint estimation of partial dependence among many 
similar graphs Danaher et al. , 2011|. The astute reader may note that 



([T]) is not jointly convex in /i and S^^. However, we can still find the 
global maximum — for fixed /Xi and /i2 it is convex, and, as we later 
show, our estimates of /ii and ^2 are completely independent of our 
estimates of Si, and S2. 

The problem ([T]) has an equivalent Lagrangian form (which we will 
write as a minimization for future convenience) 

(4) argmin - ^(/ii. Si) - £2(yU2, S2) + A ||Sj;^ - S2"^ ||^ 

(5) s.t. Si, S2 Positive Semi-Definite 
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This is the objective which we will focus on in this paper. We will call 
its solution "£i Pooled Discriminant Analysis" (£i-PDA). For A = 
these are just QDA estimates and for A sufficiently large, just LDA 
estimates. 



In this paper, we examine the £i-PDA objective; we discuss the 
connections between £i-PDA and estimating interactions in a logistic 
model; we show the efficacy of £i-PDAon real and simulated data; and 
we give an efficient algorithm to fit £i-PDA based on the alternating 
direction method of moments (ADMM). 

1.1. Reductions. One may note that our objective Q is not jointly 
convex in Hk and S^, however this is not a problem (the optimization 
splits nicely). For a fixed Si, yU* minimizes 



2 

2/(0=1 



This is true iff 



sr^ J2 (xz-/xi) = o. 

2/(0=1 

Thus, fil = Xi = a;; is the sample mean from class 1, and 

similarly is the sample mean from class 2. We can simplify our 
objective Q by substituting the sample means in for /i* and and 
noting that 



[xi - Xi 



{xi - xi)^ Si ^ {xi - xi) = ^ ^ tr ^{xi - xi)^ Si ^ 

0=1 

tr I^S^;^ {xi - Xi) {xi - Xi)^ /ni 



2/(0=1 2/(0=1 

= nx 



2/(0=1 

= nitr S^;^ ^ (x; - xi) (x/ - xi)"^ /rii 

2/(0=1 

= n\ tr [S|f "'"S'l] . 

where Si is the sample covariance matrix for class 1. 
Our new objective is 

(6) minsi,s2 - "^x logdet (S^^) + nx tr(Si;^S'i) - n2 logdet (S2 ^) 

(7) +n2tr(S2'52) + A||Sr'-S2'||i 

subject to Si and S2 positive semi-definite (PSD). This is a jointly 
convex problem in S|f ^ and S2 ^. 
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2. Solution Properties 

There is a vast literature on using norms to induce sparsity. In 
this section we will inspect the optimality conditions for our particular 
problem to gain some insight. We begin by reparametrizing objective 
(jTr)) in terms of A = (S^^ - H^^) /2, and 6 = (S^^ + Eg ^) /2 



miuA.e - ni logdet (A + 6) + Ui tr([A + 0] Si) - ria logdet (0 - A) 

(9) +n2tr([0- A]S2) + A||A||i 

To find the Karush-Kuhn optimality conditions, we take the subgradi- 
ent of this expression and set it equal to 0. We see that 

(10) - rii (^A + ©) ~' + rii^i - (e - a) + ^252 + A9(A) = 
and 

(11) -ni{k + + + 712 (A - e)"' - 712^2 = 

where A and 6 minimize the objective and i9(A) is a p by p matrix 
with 

sign(A)ij, if Ajj ^ 
e [-1,1], if A,,, =0 

Now, we can substitute ^ and £3'^ back in to the subgradient equa- 
tions: 

(12) m [Si - Si) - n2 {S2 - S2) + A9(Sr' - S2 ') = 
and 

niSi+n2S2 niEi+ 722X2 



5(A).: 



(13) S, 



pool 



ni + 7l2 Til + 772 



We find these optimality conditions curious as they directly involve 
rather than S^^. Equation (12) shows that the solution will have a 



sparse difference — S2 ^. Though somewhat unintuitive, it parallels 
the KKT conditions for the Lasso and other £1 penalized problems. In 
particular, because the subgradient of || A||i can take a variety of values 
for Aj j = 0, the optimality conditions are often satisfied with Aj j = 



for many Equation (13) shows us that the pooled average of our 
estimates is unchanged (Spool = Spool). Given the form of our penalty 
we find it interesting that the pooled average of the is constant (in- 
dependent of A) rather than some convex combination of the S^^. 
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From these optimality conditions one can easily find the optimal so- 
lutions at both ends of our path (for A = and A sufficiently large). If 
5*1 and 5*2 are full rank, then for A = the optimality conditions are 
satisfied by the QDA solution with d = 0, and for A > Xtextrmmax = 
nin2 \\Si — 5*2 1 1 {rii + the conditions are satisfied by the LDA so- 
lution with d = nin2{Si — 5'2)/[A(ni +n2)]. In Section [sj we give a 
pathwise algorithm to fit £i-PDA along our path of A-values from Amax 
to 0. 

2.1. When is the problem ill posed? Recall that if Si or 5*2 is not 
full rank, then the QDA solution is undefined. In our case one can 
see that as A — )■ we still have this difficulty, however for A > 0, so 
long as S'pooi = (^i^*! -|- n2S'2) / (ni -|- n2) is full rank, our solution is 
well defined. In the case that S'pooi is not full rank, then the solution 
is ill-defined for all A. 



3. Forward Vs Backward Model 

So far we have assumed a model in which the x-values are gener- 
ated given the class assignments. We will henceforth refer to this as 
the "backward generative model" or backward model. Many other ap- 
proaches to classification use a "forward generative model" wherein we 
consider the class assignments to be generated from the x-values (eg. 
logistic regression). Our backward model has a corresponding forward 
model. By Bayes theorem we have 

TTiexp (/i) 



F{y = l\x) 



7i2 exp {I2) + TTi exp (/i) 

exp [log(7ri/7r2) + h- k] 
1 + exp [log(7ri/7r2) + h- k] 



where 



We can simplify this to get a better handle on it. Some algebra gives 
us 

(14) logit [F{y = l\x)] = log(7ri/7r2) + /iJS2-V2/2 - /i7SrVi/2 

(15) + (S^Vi - V2)^ x + x'^ (S2 1 - S^i) x/2. 

where logit(p) = p/{l — p). This is just a logistic model with interac- 
tions and quadratic terms. In general a logistic model takes the form 

logit [P{y = l\x)] = /3o + ^ l^iXi + ^ jijXiXj 
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or in matrix form 

(16) logit [F{y = l\x)] = l3o + x + 



So our forward generative model in (14) is a logistic model with 
/3o = log(7ri/7r2) + /iJSg V2/2 - Ai7S^Vi/2 
/3 = S^Vi - ^2 V2 

r/2 = S2 1 - sr^ 

Note, that with LDA we estimate T to be identically 0, with QDA F 
is entirely nonzero, and with £i-PDA, F has both zero and nonzero 
elements. 

3.1. Estimating Interactions. Based on the forward model above, 
one can consider our sparse estimation of F as a method for estimating 
sparse interactions. There has been a recent push to estimate inter- 
actions in the high dimensional setting ( Radchenko and James| |2010| 



Zhao et al. 2009 , among others). The basic idea is to consider a general 



logistic model as in (16) (or a linear model for continuous response), 
and to estimate /3o, and F in such a way that there are few nonzero 
entries in F (often the diagonal is constrained to be 0). The simplest 
of these approaches maximize a penalized logistic log-likelihood 

n 

argmax^ r ^ {z/(0 log(w) + - V {I)) ^og{l - pi)] - A||F||i 

i=l 

s.t. log ( 7^ ) = /3o + l^^xi + x]Txil2 



,1 -Pi, 

As we have shown, for discriminant analysis considered as a forward 
model, nonzero off-diagonal terms in F = S2 ^ — S^^^ correspond to pairs 
of variables with interactions. Thus ^i-PDA estimates a logistic model 
with sparse interactions (and quadratic terms). £i-PDA differs from 
other methods because it has additional distributional assumptions on 
the covariates which in turn put constraints on our estimates of /3o, /3, 
and F, but the underlying idea is the same. 



3.2. Linear Vs Quadratic Decision Boundaries. The sparsity of 
F again shows up if we consider the decision boundaries of discriminant 
analysis. For each method (LDA, QDA and £i-PDA), once the param- 
eters are estimated, W is partitioned into two connected spaces — one 
space where the estimated posterior probability of an observation is 
higher for class 1 and another space where it is higher for class 2. The 
decision boundary is D = {x \ P(?/ = l|a;) = 0.5} which is equivalent to 
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{x I logit [P{y 
framework 



l|x)] = 0}. Referring back to our forward generative 
(14), we see that 



D 



X 



/3o + + x^rx = 



The nonzero terms in f = S2 ^ — ^ correspond to pairs of dimensions 
in which the decision boundary is quadratic rather than hnear. As ex- 
pected, LDA has a hnear decision boundary, and QDA has a quadratic 
decision boundary (with all cross terms included). £i-PDAis a hybrid 
of these — it is quadratic in some terms and linear in others. 

4. Comparisons 

A number of other methods have been proposed for discriminant 
analysis using sparsity and pooling. These methods are useful, but 
fill a different role than £i-PDA. We will compare 2 of these ideas to 
£i-PDA and discuss when each is appropriate. 



4.1. RDA. Regularized Discriminant Analysis Friedman, 1989 esti- 
mates the within class covariance matrices as a convex combination of 
the LDA and QDA estimates. Like £i-PDAit gives a path from LDA 
to QDA. In contrast RDA is basis equivariant (changing the basis on 
which you train will not change the predictions), while ^i-PDAis not. 
In RDA, one uses a common idea in empirical bayes and stein estima- 
tion — we often overestimate the magnitude of extreme effects, in our 
case we overestimate the extremity of largest and smallest eigenvalues 
of Si — S2, so RDA shrinks these values. On the other hand,£i-PDA is 
very basis specific. In £i-PDA, as in all sparse signal processing, we 
believe we have a good basis (in our case, we believe that the differ- 
ences are sparse in this basis) and would like to leverage this in our 
estimation. 



4.2. Sparse LDA 

"sparse LDA." (iDudoit et al 



A number of methods have been proposed for 
[2002 



Bickel and Levina 2004 , Witten 



and Tibshirani 2011|, among others). These methods either assume 



diagonal covariance matrices and look for sparse mean differences, or 
assume Si = S2 and (either implicitly or explicitly) look for sparsity in 
S~^ (yUi — fi2)- This gives a linear decision rule which uses only few of 
the variables. These methods are well suited to very high dimensional 
problems (they require many fewer observations than LDA). 



In contrast £i-PDA does not remove variables — it only shrinks de- 
cision boundaries from quadratic to linear. It is not well suited to very 
high dimensional problems. In particular, the solution is degenerate if 
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p > rii + n2, but it will generally perform better than sparse LDA for 
p <ni + n2. 



To draw another parallel to logistic regression (as in Section 3.1), 
Sparse LDA is similar to sparse estimation of main effects (with no in- 
teractions), while £i-PDAis similar to sparse estimation of interactions 
(with all main effects included). 

5. Optimization 

One of the main attractions of this criterion is that it is a convex 
problem and hence a global optimum can be found relatively quickly. 
In particular we have developed a method which can solve this for up to 
several hundred variables (though the accuracy in poorly conditioned 
larger problems can be an issue). 

First, for ease of notation we introduce new variables: let A = 
B = ^, Sa = Si, and Sb = S2- If we plug in the sample means for 
/ii and /i2, our new criterion (negated for convenience) is now 

(17) minyi_B — ni logdet A + rii ti^ASA) — '"■2 logdet B 

(18) +n2tT{BSB) + X\\A- B\\i 

subject to A, B PSD, where ni is the number of observations in group 
1, n2 is the number of observations in group 2. Recall that this is con- 
vex in A and B. 



One could solve this using interior point methods discussed in Boyd 



and Vandenberghe 2004 . Unfortunately, for semi-definite programs 
the complexity of interior point algorithms scales like p^, making this 
approach impractical for p larger than 15 or 20. Instead we develop 
an approach based on the alternating direction method of moments 
(ADMM) which scales up to several hundred covariates. 

5.1. ADMM Algorithm. ADMM is an older class of algorithms which 



has recently seen a re-emergence largely thanks to Boyd et al. 2010]. 
Our particular algorithm is a adaptation of their ADMM algorithm 
for sparse inverse covariance estimation. The motivation for this algo- 
rithm is simple — the combination of a logdet term and a || ■ ||i term 
makes our optimization difficult, so we split the 2 up and introduce 
an auxiliary variable C = A — B and a dual variable F. We leave the 
details of developing this algorithm to the appendix (though they are 
straightforward). The exact algorithm is 

(1) Initialize Aq, Bq, Cq, and Fq and choose a fixed p > 
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(2) Iterate until convergence 
(a) Update A by 

Ak+i = UAU^ 

where p {Ck + + V k) — riiS a = U DU^ is its eigenvalue 
decomposition (with D = diag{di)), and A is diagonal with 



~ _ di + df + April 

An — - 

2p 

(b) Update B by 

Bk+i = VBV^ 

where p (A^+i — Ck — r^)— ^25'^ = VEV~^ is its eigenvalue 
decomposition (with E = diag{ei)) and B is diagonal with 

~ Ci + + 4pn2 

Bii = 

2p 

(c) Update C by 

Ck+i = Sa/p (^fc+i — -Bfc+i — Tfc) 
where S\{-) is the element- wise soft thresholding operator 
SxiZ)ij = sign (Zij) max {\Zij \ - A, 0) 

(d) update F by 

Tfc+i = Ffe + p (Cfc+i — Ak+i + Bk+i) 

Upon convergence, A* and i?* are the variables of interest (the rest 
may be discarded). The complexity of each step of this algorithm is 
dominated by the eigenvalue decompositions, each of which require 
0{p^) operations. 

6. Path-wise Solution 

Often we do not know a-priori what value our regularization param- 
eter should be and would like to fit the entire path from Xmax (corre- 
sponding to the LDA solution) to A = (corresponding to the QDA 
solution). We define 



ni + n2 

It is easy to see that for A > Xmax, Si = S2 = (our LDA 



see that V = "i"^^ 1+^2) optimal dual variable for A > A^ 

r).i ^ — '' 



+S2) 



solution) satisfies (10) and (11), and thus is our solution. One can also 
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To solve along a path we start at A = Xmax, and plug in our known 
solution. We then decrease A and solve the new problem, initializing 
our algorithm at the previous Si, S2, and F. Because A changes only 
slightly (and thus our solution changes only slightly), this approach is 
very efficient as compared to solving from scratch at each A. When Sa 
and Sb are full rank our QDA solution is well defined and it is possible 
to run our path all the way to A = 0. Due to convergence issues along 
the potentially poorly conditioned end of the path (which we discuss in 
the next section) we instead choose to set Xmin = ^^max and log-linearly 
interpolate between the two (in our implementation default e value is 
0.01). 

6.1. Convergence Issues. While ADMM is a good algorithm for find- 
ing an near exact solution, it is not considered a great algorithm for 
an exact solution (though it does eventually converge to arbitrary tol- 
erance, this may require an unwieldy number of iterations). In our 
application, solving to machine tolerance is unnecessary (the statisti- 
cal uncertainties are much greater than this). However, in some cases 
(especially with p ~ ni + n2), near the end of the path our solution 
converges extremely slowly. Unfortunately there is no simple fix for 
this (more accurate interior point algorithms don't scale beyond 15 or 
20 variables). While not ideal, this does not overly concern us — con- 
vergence is slow in the region where — ^ is not very sparse (a 
region where we expect £i-PDA to perform poorly anyways). We will 
see an example of this issue arise later in Section [8j 

One should also note that convergence rates near the end of the path 
are highly dependent on our choice of p. This is characteristic of all 
ADMM problems. To date, finding a disciplined choice of p for ADMM 
is still an open question. We use p = 1 as our default, as it appears to 
work reasonably well for a range of problems. 

7. Simulated and Real Data 

To show its efficacy, we applied £i-PDAto real and simulated data. 
In both cases we compare our method to linear, quadratic and regular- 
ized discriminant analysis and show improvement over both in overall 
classification error and on ROC plots. In all problems £i-PDA was ffi 
with 30 lambda values log-linearly interpolating Amax and 0.01 * Amax- 
RDA was ffi with 30 equally spaced A-values between and 1. 

7.1. Simulated Data. We simulated data from the two class gaussian 
model described in Section [T] with p = 30 covariates. We set Si = J^xp 
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ROC curve 




Sensitivity 



Figure 1. Average ROC curve for simulated data with 
Uk = 33, c = 0.9 

and 

L___o___ A 

]Ap-5)x(p-5) J 

where C is 5 x 5 matrix with constant value c on the off diagonal entries, 
and 1 on the diagonal. We also set a small mean difference between 
the groups: /ii = 0^ 

where A is a 10-vector of Is 

Under this model S]~^ — E^^ is nonzero only in the upper left 5x5 
submatrix. We simulated using varying numbers of observations rii = 
n2 G (33, 40, 60), and values of c G (0.3, 0.6, 0.9). We used 3 data 
sets for each simulation — one to fit the initial model, one to choose 
the optimal value of A and our final set to get an unbiased estimate of 
misclassification error. 




As you can see from Table 7.1 , when the signal to noise ratio (SNR) is 
too small £i-PDA adaptively shrinks towards LDA and sees similar per- 
formance. When SNR is sufficiently large (the third group in the table), 
£i-PDAis able to pick out the nonzero entries and achieves substan- 
tially better misclassification rates. In these cases RDA also does fairly 
well (adaptively choosing between LDA and QDA), however because it 



DISCRIMINANT ANALYSIS WITH ADAPTIVELY POOLED COVARIANCE 13 







#01 
per C 

33 


Dservai 
iroup 

40 


tions 
(ni.) 

60 


c = 0.3 


^i-PDA 
LDA 
QDA 
RDA 


0.82 
0.82 
0.58 
0.81 


0.85 
0.85 
0.65 
0.85 


0.88 
0.88 
0.74 
0.88 


c = 0.6 


£i-PDA 
LDA 
QDA 
RDA 


0.82 
0.82 
0.59 
0.81 


0.83 
0.84 
0.66 
0.83 


0.87 
0.86 
0.76 
0.86 


c = 0.9 


£i-PDA 
LDA 
QDA 
RDA 


0.84 
0.80 
0.65 
0.81 


0.88 
0.83 
0.76 
0.84 


0.92 
0.85 
0.86 
0.88 



Table 1. Average % of correct classifications over 100 
simulated datasets (standard errors for all entries are less 
than 0.006) 



does not take sparsity into account, it is outperformed by £i-PDA. We 
consider the large SNR case more carefully in Figure [T] (an ROC curve 
for rik = 33, c = 0.9) Again we used 3 data sets per realization to get 
an unbiased curve estimate (and ran 100 random realizations, though 
only average is shown on Figure [T| . We estimated AUG for each pro- 
cedure: £i-PDA 0.924 ±0.002, LDA 0.875 ±0.003, QDA 0.732 ±0.007, 
and RDA 0.887 ± 0.003. £i-PDAdoes substantially better than LDA, 
QDA, and RDA. With p = 30 and Uk = 33 there is clearly not enough 
data for QDA to perform well (though the sample correlation matrices 
are still invertible). However, as noted, £i-PDAalso has a large edge 
over LDA and RDA. 



14 



NOAH SIMON AND ROB TIBSHIRANI 




Figure 2. Plot of validated prediction accuracy for reg- 
ularization path in 58 mines/rocks, with Xmin = O.OlAmax 
for ^i-PDA, and ROC curve for An 

8. Real Data 



We also applied £i-PDAto the "Sonar, Mines vs. Rocks" data |Gor- 



man and Sejnowski, 20101. This dataset has 60 sonar signals measured 



on each of 208 objects (each labeled as either a rock or a mine). We 
randomly chose 150 Mines/Rocks to train with, and then classified the 
remaining 58. 

As one can see from Figure [2| £i-PDA performs better on this data 
than either LDA, QDA or RDA. Estimated true classification rate 
peaks near the middle of our regularization path, showing that a fair 
amount of regularization can significantly improve classification. As we 



mentioned in Section 6.1 one can see convergence issues near the end of 
our path — we would expect the CV error at our 30th A- value to nearly 
match that of QDA (nearly rather than exactly because we don't run 
to Amin = 0). However, it does not, indicating that our solution is not 
converging to the QDA solution. This does not overly concern us as 
our validation error reaches its crest well before this. 
We also see an ROC curve comparing ^i-PDA, LDA, QDA, and RDA. 
For RDA we chose the simplest model which maximized predictive ac- 
curacy (the 21st A value), and for ^i-PDAthe tenth A value, the most 
regularized model before a precipitous drop in predictive accuracy (so 
as to minimize bias for £i-PDA). The £i-PDA curve may still be slightly 
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biased as we chose it from a section of our path seen to do well in over- 
all classification error (though not the peak). Nonetheless, this curve 
appears indicative of an advantage from £i-PDAover LDA, QDA, and 
RDA. 



9. Discussion 

In this paper we proposed £i-PDA, a classification method for gauss- 
ian data which adaptively pools the precision matrices. We motivated 
our method, and showed connections between it and estimating sparse 
interactions. We gave two efficient algorithms to fit have £i-PDA, and 
have shown its efficacy on real and simulated data. We have made and 
plan to provide an R implementation for £i-PDA publically available 
on CRAN. 



10. Appendix a 
We include a short overview of the ADMM algorithm. We can rewrite 



(|T7j) as 

mm.A,B — ni logdet A + rii ti^ASA) — n2 logdet B 
+ n2tT{BSB) + X\\C\\i 
s.t. C = A-B 

At the optimum we have C = A — B, so this is equivalent to 

(19) mmA^B,c " ^1 logdet A + rii tr^ASA) — ■"-2 logdet B 

(20) + n2tT{BSB) + A||C||i + ^\\C-A + B\\l 

(21) s.t. C = A-B 

p can be any fixed positive number (though its choice will affect the 
convergence rate of algorithm). We will motivate this addition shortly. 
Now, using strong duality, we can move our contraint into the objective, 
and finally arrive at 

(22) 

maxr min^^s,c — '"■i logdet A + rii ti^ASA) — n2 logdet B 

(23) + tr(55B) + A||C||i +p trace [V^ {C-A + B 

(24) +^\\C-A + B\\l 
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For ease of notation we denote 

ipriA B, C) = -rii logdet A + rii tr(AS'^) - ris logdet B 

+ 712 tT{BSB) + A| |C| |i + p trace [r^ {C-A + B)] 
+ ^\\C-{A-B)\\l 

and 

M{T) = mmA,B,cMAB,C). 

Now, by basic convex analysis, the dual of any strongly convex function 
(with convexity constant p) is differentiable and its derivative has lips- 



chitz constant p. Unfortunately (19) is not necessarily strongly convex, 
however the addition of ||C — A + BWjp, affords it many of the same 
properties. In particular if C*, A*, B* are the argmin of tp^^ for a given 
Fq, then 

d 

M =C* -A* + B* 

To 

If we could easily calculate M(F), then we could use gradient ascent 
on F 

r.+i = Ffc + p{Cl -Al + Bl) 
and one would have and B^ converging to the argmax of our original 
problem (|4]). Unfortunately, M(F) is not easy to calculate, however ■i/'r 
is relatively simple to minimize in one variable at a time (A, B, or 
C) with all other variables fixed. In ADMM we employ the same idea 
as gradient descent, only we fudge the details — instead of actually 
calculating M(F), we minimize first in A, with 5, and C fixed, then in 
B with A and C fixed and finally in C with A and B fixed. After these 
3 updates, we take our "gradient" step as before (though this time it is 
not a true gradient step). This leads to the following algorithm: 

(1) Initialize Aq, Bq, Cq, and Fq 

(2) Iterate until convergence 

(a) Update A by 

Ak+i = argmin^ Vr^ {A, Bk,Ck) 

(b) Update B by 

Bk+i = argmmBi^rki^k+i, B,Ck) 

(c) Update C by 

Ck+i = axgmmc ipTki^k+i, Bk+i, C) 

(d) Take "gradient step"; update F by 

Tfc+i = Tfe + P (Cfc+i — Ak+i + Bk+i) 
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One may note that if we instead iterate steps a — c to convergence each 
time before taking step we end up again with gradient descent. 

10.1. Inner Loop Updates. In this section we derive the exact up- 
dates for A, and C in steps a, h and c of our ADMM algorithm. We 
begin with A: to find argmin^ ■?/'rj.(A, -Bfc, C'fc) we must minimize 

- rii logdet A + rii ti{ASA) + p trace [rj {Ck - A + Bk)] 

+ ^\Ck — A + Bk\\\ 

If we take the derivative of this and set it equal to we get 

(25) pA - n^A-^ = p{Ck + Bk + T^) - u^Sa 

Now if we let p{Ck + Bk + Tk) — tiiSa = UDU^ be its eigenvalue 
decomposition (with D = diag{di)), then (25) is satisfied by 

A = uAu^ 

where A is diagonal and 

~ _di + ^Jd\ + 4pni 
2p 

We can solve for Bk+\ similarly. Let p (A^+i — Ck — Tk) — n2SB = 
VEV~^ be its eigenvalue decomposition (with E = diag{ei)). Then 
eLTgmmBiprk{Ak+i,Bk,Ck) is 

B = VBV^ 

where B is diagonal and 

~ ei + + 4pn2 
2p 

The last variable to solve for is C. Ignoring all terms without a C, we 
need to minimize 

2 

This is equivalent to minimizing 



A||C||i + ptrace [F^ {C - A + B)] + \\C ~ A + B\ 
to minimizing 

^||c-(A,+i-B,+i-r)||^ + -||C||i 

2 p 



which is solved by 

C = Sa/p (^fc+i — -Bfc+i — r) 

where Sa/p is the entry-wise soft thresholding operator on the entries 
of the matrix. For i ^ j 

Sx/piX)ij = sign (Xij) max {\Xij\ - \/p, 0) 
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So, in full detail, our algorithm is 

(1) Initialize ^40, Bq, Cq, and Tq 

(2) Iterate until convergence 

(a) Update A by 

where p (C^ — 3^ + Ffe) — UiSa — UDU^ is its eigenvalue 
decomposition (with D — diag{di)), and A is diagonal with 

~ _ d, + y/ (/j + \pni 

(b) Update B by 

Bk+i = VBV^ 

where p (Afe+i — Ck — Ffe)— n25'B = VEV^ is its eigenvalue 
decomposition (with = diag{ei)) and S is diagonal with 

(c) Update C by 

C'fe+i = Sa/p (^fc+i — -Bfe+i — Ffe) 

(d) update F by 

Ffc+i = Ffc + p (Cfc+i — + Bk+i) 

The complexity of each step of this algorithm is dominated by the eigen- 
value decompositions, each of which require 0{p^) operations. For this 
reason, while the algorithm can solve problems for p in the hundreds, it 
will be difficult to scale to larger problems. One should note that p in 
the hundreds is already an optimization problem with tens of thousands 
of variables. 
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